Introduction

What Is the Grammar of Graphics?

  1. Data that you want to visualise and a set of aesthetic mappings describing how variables in the data are mapped to aesthetic attributes that you can perceive.

  2. Layers made up of geometric elements and statistical transformation. Geometric objects, geoms for short, represent what you actually see on the plot: points, lines, polygons, etc. Statistical transformations, stats for short, summarise data in many useful ways. For example, binning and counting observations to create a histogram, or summarising a 2d relationship with a linear model.

  3. The scales map values in the data space to values in an aesthetic space, whether it be colour, or size, or shape. Scales draw a legend or axes, which provide an inverse mapping to make it possible to read the original data values from the plot.

  4. A coordinate system, coord for short, describes how data coordinates are mapped to the plane of the graphic. It also provides axes and gridlines to make it possible to read the graph.We normally use a Cartesian coordinate system, but a number of others are available, including polar coordinates and map projections.

  5. A faceting specification describes how to break up the data into subsets and how to display those subsets as small multiples. This is also known as conditioning or latticing/trellising.

  6. A theme which controls the finer points of display, like the font size and background colour. While the defaults in ggplot2 have been chosen with care, you may need to consult other references to create an attractive plot. A good starting place is Tufte’s early works (Tufte, 1990, 1997, 2001).

  1. It doesn’t suggest what graphics you should use to answer the questions you are interested in. While this book endeavours to promote a sensible process for producing plots of data, the focus of the book is on how to produce the plots you want, not knowing what plots to produce. For more advice on this topic, you may want to consult Robbins (2013), Cleveland (1993), Chambers et al. (1983), and Tukey (1977).

  2. It does not describe interactivity: the grammar of graphics describes only static graphics and there is essentially no benefit to displaying them on a computer screen as opposed to a piece of paper. ggplot2 can only create static graphics, so for dynamic and interactive graphics you will have to look elsewhere (perhaps at ggvis, described below). Cook and Swayne (2007) provides an excellent introduction to the interactive graphics package GGobi. GGobi can be connected to R with the rggobi package (Wickham et al., 2008).

How Does ggplot2 Fit in with Other R Graphics?

Installation

Getting Started with ggplot2

Introduction

  • The goal of this chapter is to teach you how to produce useful graphics with ggplot2 as quickly as possible. You’ll learn the basics of ggplot() along with some useful “recipes” to make the most important plots. ggplot() allows you to make complex plots with just a few lines of code because it’s based on a rich underlying theory, the grammar of graphics. Here we’ll skip the theory and focus on the practice, and in later chapters you’ll learn how to use the full expressive power of the grammar.

Key Components

  1. data,

  2. A set of aesthetic mappings between variables in the data and visual properties, and

  3. At least one layer which describes how to render each observation. Layers are usually created with a geom function.

Example

library(ggplot2)
library(dplyr)

ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()

  • This produces a scatterplot defined by:
  1. Data: mpg.

  2. Aesthetic mapping: engine size mapped to x position, fuel economy to y position.

  3. Layer: points.

Exercises

library(ggplot2)
library(dplyr)

ggplot(mpg, aes(model, manufacturer)) + geom_point()

library(ggplot2)
library(dplyr)

ggplot(mpg, aes(cty, hwy)) + geom_point()

library(ggplot2)
library(dplyr)

ggplot(diamonds, aes(carat, price)) + geom_point()

library(ggplot2)
library(dplyr)

ggplot(economics, aes(date, unemploy)) + geom_line()

library(ggplot2)
library(dplyr)

ggplot(mpg, aes(cty)) + geom_histogram()

Colour, Size, Shape and Other Aesthetic Attributes

library(ggplot2)
library(dplyr)

ggplot(mpg, aes(displ, cty, colour = class)) + geom_point()

library(ggplot2)
library(dplyr)
library(gridExtra)

g1<-ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = "blue"))
g2<-ggplot(mpg, aes(displ, hwy)) + geom_point(colour = "blue")

grid.arrange(g1,g2,ncol=2)

Facetting

library(ggplot2)
library(dplyr)
library(gridExtra)

ggplot(mpg, aes(displ, hwy)) + geom_point() + facet_wrap(~class)

Example of facet_grid plus regression line(lm)

library(ggplot2)
library(dplyr)
library(gridExtra)

ggplot(mpg, aes(displ, hwy)) + geom_point() + facet_grid(.~class)+geom_smooth(method="lm",se=FALSE)

library(ggplot2)
library(dplyr)
library(gridExtra)

ggplot(mpg, aes(displ, hwy)) + geom_point() + facet_grid(class~.)+geom_smooth(method="lm",se=FALSE)

Adding a Smoother to a Plot

library(ggplot2)
library(dplyr)
library(gridExtra)

ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_smooth()

library(ggplot2)
library(dplyr)
library(gridExtra)

g1<-ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_smooth(span = 0.2)

g2<-ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_smooth(span = 1)

grid.arrange(g1,g2,ncol=2)

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)

ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_smooth(method = "gam", formula = y ~ s(x))

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)

ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_smooth(method = "lm")

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_smooth(method = "rlm")

Boxplots and Jittered Points

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

ggplot(mpg, aes(drv, hwy)) + geom_point()

  1. Jittering, geom jitter(), adds a little random noise to the data which can help avoid overplotting.

  2. Boxplots, geom boxplot(), summarise the shape of the distribution with a handful of summary statistics.

  3. Violin plots, geom violin(), show a compact representation of the “density” of the distribution, highlighting the areas where more points are found.

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

g1<-ggplot(mpg, aes(drv, hwy)) + geom_jitter()
g2<-ggplot(mpg, aes(drv, hwy)) + geom_boxplot()
g3<-ggplot(mpg, aes(drv, hwy)) + geom_violin()

grid.arrange(g1,g2,g3,ncol=3)

Histograms and Frequency Polygons

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

g1<-ggplot(mpg, aes(hwy)) + geom_histogram()
g2<-ggplot(mpg, aes(hwy)) + geom_freqpoly()

grid.arrange(g1,g2,ncol=2)

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

g1<-ggplot(mpg, aes(hwy)) + geom_freqpoly(binwidth = 2.5)
g2<-ggplot(mpg, aes(hwy)) + geom_freqpoly(binwidth = 1)

grid.arrange(g1,g2,ncol=2)

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

g1<-ggplot(mpg, aes(displ, colour = drv)) + geom_freqpoly(binwidth = 0.5)
g2<-ggplot(mpg, aes(displ, fill = drv)) + geom_histogram(binwidth = 0.5) + facet_wrap(~drv, ncol = 1)

grid.arrange(g1,g2,ncol=2)

Frequency Density

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

g1<-ggplot(mpg, aes(displ, colour = drv)) + geom_density()
g2<-ggplot(mpg, aes(displ, colour = drv)) + geom_density() + facet_wrap(~drv, ncol = 1)
grid.arrange(g1,g2,ncol=2)

g1<-ggplot(mpg, aes(displ, fill = drv)) + geom_density()
g2<-ggplot(mpg, aes(displ, fill = drv)) + geom_density() + facet_wrap(~drv, ncol = 1)
grid.arrange(g1,g2,ncol=2)

Bar Charts

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

ggplot(mpg, aes(manufacturer)) + geom_bar()

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

drugs <- data.frame(
drug = c("a", "b", "c"),
effect = c(4.2, 9.7, 6.1)
)

g1<-ggplot(drugs, aes(drug, effect)) + geom_bar(stat = "identity")
g2<-ggplot(drugs, aes(drug, effect)) + geom_point()

grid.arrange(g1,g2,ncol=2)

Time Series with Line and Path Plots

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

g1<-ggplot(economics, aes(date, unemploy / pop)) + geom_line()
g2<-ggplot(economics, aes(date, uempmed)) + geom_line()

grid.arrange(g1,g2,ncol=2)

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

g1<-ggplot(economics, aes(unemploy / pop, uempmed)) + geom_path() + geom_point()

    year <- function(x) as.POSIXlt(x)$year + 1900

g2<-ggplot(economics, aes(unemploy / pop, uempmed)) + geom_path(colour = "grey50") +
geom_point(aes(colour = year(date)))

grid.arrange(g1,g2,ncol=2)

Modifying the Axes

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

g1<-ggplot(mpg, aes(cty, hwy)) + geom_point(alpha = 1 / 3)

g2<-ggplot(mpg, aes(cty, hwy)) + geom_point(alpha = 1 / 3) + xlab("city driving (mpg)") +
ylab("highway driving (mpg)")

# Remove the axis labels with NULL
g3<-ggplot(mpg, aes(cty, hwy)) + geom_point(alpha = 1 / 3) + xlab(NULL) + ylab(NULL)

grid.arrange(g1,g2,g3,ncol=3)

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

g1<-ggplot(mpg, aes(drv, hwy)) + geom_jitter(width = 0.25)

g2<-ggplot(mpg, aes(drv, hwy)) + geom_jitter(width = 0.25) + xlim("f", "r") + ylim(20, 30)

# For continuous scales, use NA to set only one limit
g3<-ggplot(mpg, aes(drv, hwy)) + geom_jitter(width = 0.25, na.rm = TRUE) + ylim(NA, 30)

grid.arrange(g1,g2,g3,ncol=3)

Output

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

p <- ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) + geom_point()

print(p)

summary(p)
data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl,
  class [234x11]
mapping:  x = ~displ, y = ~hwy, colour = ~factor(cyl)
faceting: <ggproto object: Class FacetNull, Facet, gg>
    compute_layout: function
    draw_back: function
    draw_front: function
    draw_labels: function
    draw_panels: function
    finish_data: function
    init_scales: function
    map_data: function
    params: list
    setup_data: function
    setup_params: function
    shrink: TRUE
    train_scales: function
    vars: function
    super:  <ggproto object: Class FacetNull, Facet, gg>
-----------------------------------
geom_point: na.rm = FALSE
stat_identity: na.rm = FALSE
position_identity 
  1. Render it on screen with print(). This happens automatically when running interactively, but inside a loop or function, you’ll need to print() it yourself.

  2. • Save it to disk with ggsave(), described in Sect. 8.5. Save png to disk ggsave(“plot.png”, width = 5, height = 5)

  3. Briefly describe its structure with summary().

  4. Save a cached copy of it to disk, with saveRDS(). This saves a complete copy of the plot object, so you can easily re-create it with readRDS(). saveRDS(p, “plot.rds”) q <- readRDS(“plot.rds”)

Quick Plots

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

g1<-qplot(displ, hwy, data = mpg)
g2<-qplot(displ, data = mpg)

grid.arrange(g1,g2,ncol=2)

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

g1<-qplot(displ, hwy, data = mpg, colour = "blue")
g2<-qplot(displ, hwy, data = mpg, colour = I("blue"))

grid.arrange(g1,g2,ncol=2)

Toolbox

Introduction

  • The layered structure of ggplot2 encourages you to design and construct graphics in a structured manner. You’ve learned the basics in the previous chapter, and in this chapter you’ll get a more comprehensive task-based introduction. The goal here is not to exhaustively explore every option of every geom, but instead to show the most important tools for a given task. For more information about individual geoms, along with many more examples illustrating their use, see the documentation. It is useful to think about the purpose of each layer before it is added. In general, there are three purposes for a layer:
  1. To display the data. We plot the raw data for many reasons, relying on our skills at pattern detection to spot gross structure, local structure, and outliers. This layer appears on virtually every graphic. In the earliest stages of data exploration, it is often the only layer.

  2. To display a statistical summary of the data. As we develop and explore models of the data, it is useful to display model predictions in the context of the data. Showing the data helps us improve the model, and showing the model helps reveal subtleties of the data that we might otherwise miss. Summaries are usually drawn on top of the data.

  3. To add additional metadata: context, annotations, and references. A metadata layer displays background context, annotations that help to give meaning to the raw data, or fixed references that aid comparisons across panels. Metadata can be useful in the background and foreground. A map is often used as a background layer with spatial data. Background metadata should be rendered so that it doesn’t interfere with your perception of the data, so is usually displayed underneath the data and formatted so that it is minimally perceptible. That is, if you concentrate on it, you can see it with ease, but it doesn’t jump out at you when you are casually browsing the plot.

Basic Plot Types

  1. geom area() draws an area plot, which is a line plot filled to the y-axis (filled lines). Multiple groups will be stacked on top of each other.

  2. geom bar(stat = “identity”) makes a bar chart.We need stat = “identity” because the default stat automatically counts values (so is essentially a 1d geom, see Sect. 3.11. The identity stat leaves the data unchanged. Multiple bars in the same location will be stacked on top of one another.

  3. geom line() makes a line plot. The group aesthetic determines which observations are connected; see Sect. 3.5 for more detail. geom line() connects points from left to right; geom path() is similar but connects points in the order they appear in the data. Both geom line() and geom path() also understand the aesthetic linetype, which maps a categorical variable to solid, dotted and dashed lines.

  4. geom point() produces a scatterplot. geom point() also understands the shape aesthetic.

  5. geom polygon() draws polygons, which are filled paths. Each vertex of the polygon requires a separate row in the data. It is often useful to merge a data frame of polygon coordinates with the data just prior to plotting. Section 3.7 illustrates this concept in more detail for map data.

• geom rect(), geom tile() and geom raster() draw rectangles. geom rect() is parameterised by the four corners of the rectangle, xmin, ymin, xmax and ymax. geom tile() is exactly the same, but parameterised by the center of the rect and its size, x, y, width and height. geom raster() is a fast special case of geom tile() used when all the tiles are the same size.

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

df <- data.frame(
x = c(3, 1, 5),
y = c(2, 4, 6),
label = c("a","b","c")
)

p <- ggplot(df, aes(x, y, label = label)) +
labs(x = NULL, y = NULL) + # Hide axis label
theme(plot.title = element_text(size = 12)) # Shrink plot title
g1<-p + geom_point() + ggtitle("point")
g2<-p + geom_text() + ggtitle("text")
g3<-p + geom_bar(stat = "identity") + ggtitle("bar")
g4<-p + geom_tile() + ggtitle("raster")

grid.arrange(g1,g2,g3,g4,ncol=4)

#some more plots
g5<-p + geom_line() + ggtitle("line")
g6<-p + geom_area() + ggtitle("area")
g7<-p + geom_path() + ggtitle("path")
g8<-p + geom_polygon() + ggtitle("polygon")

grid.arrange(g5,g6,g7,g8,ncol=4)

Labels

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

df <- data.frame(x = 1, y = 3:1, family = c("sans", "serif", "mono"))
ggplot(df, aes(x, y)) + geom_text(aes(label = family, family = family))

  1. showtext, https://github.com/yixuan/showtext, by Yixuan Qiu, makes GD-independent plots by rendering all text as polygons.

  2. extrafont, https://github.com/wch/extrafont, by Winston Chang, converts fonts to a standard format that all devices can use.

  1. fontface specifies the face: “plain” (the default), “bold” or “italic”.
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

df <- data.frame(x = 1, y = 3:1, face = c("plain", "bold", "italic"))
ggplot(df, aes(x, y)) + geom_text(aes(label = face, fontface = face))

  1. You can adjust the alignment of the text with the hjust (“left”, “center”, “right”, “inward”, “outward”) and vjust (“bottom”, “middle”, “top”, “inward”, “outward”) aesthetics. The default alignment is centered. One of the most useful alignments is “inward”: it aligns text towards the middle of the plot:
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

df <- data.frame(
x = c(1, 1, 2, 2, 1.5),
y = c(1, 2, 1, 2, 1.5),
text = c(
"bottom-left", "bottom-right",
"top-left", "top-right", "center"
)
)

g1<-ggplot(df, aes(x, y)) + geom_text(aes(label = text))
g2<-ggplot(df, aes(x, y)) + geom_text(aes(label = text), vjust = "inward", hjust = "inward")

grid.arrange(g1,g2,ncol=2)

  1. size controls the font size. Unlike most tools, ggplot2 uses mm, rather than the usual points (pts). This makes it consistent with other size units in ggplot2. (There are 72.27 pts in a inch, so to convert from points to mm, just multiply by 72.27/25.4.)

  2. angle specifies the rotation of the text in degrees.

  1. Often you want to label existing points on the plot. You don’t want the text to overlap with the points (or bars etc), so it’s useful to offset the text a little. The nudge x and nudge y parameters allow you to nudge the text a little horizontally or vertically:
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

df <- data.frame(trt = c("a", "b", "c"), resp = c(1.2, 3.4, 2.5))
ggplot(df, aes(resp, trt)) +
geom_point() +
geom_text(aes(label = paste0("(", resp, ")")), nudge_y = -0.25) +
xlim(1, 3.6)

  1. If check overlap = TRUE, overlapping labels will be automatically removed. The algorithm is simple: labels are plotted in the order they appear in the data frame; if a label would overlap with an existing point, it’s omitted. This is not incredibly useful, but can be handy.
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

g1<-ggplot(mpg, aes(displ, hwy)) + geom_text(aes(label = model)) + xlim(1, 8)

g2<-ggplot(mpg, aes(displ, hwy)) + geom_text(aes(label = model), check_overlap = TRUE) + xlim(1, 8)

grid.arrange(g1,g2,ncol=2)

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

label <- data.frame(
waiting = c(55, 80),
eruptions = c(2, 4.3),
label = c("peak one", "peak two")
)

ggplot(faithfuld, aes(waiting, eruptions)) +
geom_tile(aes(fill = density)) +
geom_label(data = label, aes(label = label))

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(directlabels)

g1<-ggplot(mpg, aes(displ, hwy, colour = class)) +
geom_point()

g2<-ggplot(mpg, aes(displ, hwy, colour = class)) + geom_point(show.legend = FALSE) +
directlabels::geom_dl(aes(label = class), method = "smart.grid")

grid.arrange(g1,g2,ncol=2)

Annotations

  1. geom text() to add text descriptions or to label points Most plots will not benefit from adding text to every single observation on the plot, but labelling outliers and other important points is very useful.

  2. geom rect() to highlight interesting rectangular regions of the plot. geom rect() has aesthetics xmin, xmax, ymin and ymax.

  3. geom line(), geom path() and geom segment() to add lines. All these geoms have an arrow parameter, which allows you to place an arrowhead on the line. Create arrowheads with arrow(), which has arguments angle, length, ends and type.

  4. geom vline(), geom hline() and geom abline() allow you to add reference lines (sometimes called rules), that span the full range of the plot.

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

ggplot(economics, aes(date, unemploy)) + geom_line()

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

presidential <- subset(presidential, start > economics$date[1])
ggplot(economics) +
geom_rect(
aes(xmin = start, xmax = end, fill = party),
ymin = -Inf, ymax = Inf, alpha = 0.2,
data = presidential
) +
geom_vline(
aes(xintercept = as.numeric(start)),
data = presidential,
colour = "grey50", alpha = 0.5
) +
geom_text(
aes(x = start, y = 2500, label = name),
data = presidential,
size = 3, vjust = 0, hjust = 0, nudge_x = 50
) +
geom_line(aes(date, unemploy)) +
scale_fill_manual(values = c("blue", "red"))

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

yrng <- range(economics$unemploy)
xrng <- range(economics$date)
caption <- paste(strwrap("Unemployment rates in the US have
varied a lot over the years", 40), collapse = "\n")
ggplot(economics, aes(date, unemploy)) +
geom_line() +
geom_text(
aes(x, y, label = caption),
data = data.frame(x = xrng[1], y = yrng[2], caption = caption),
hjust = 0, vjust = 1, size = 4
)

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

ggplot(economics, aes(date, unemploy)) +
geom_line() +
annotate("text", x = xrng[1], y = yrng[2], label = caption,
hjust = 0, vjust = 1, size = 4
)

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)

ggplot(diamonds, aes(log10(carat), log10(price))) +
geom_bin2d() +
facet_wrap(~cut, nrow = 1)

mod_coef <- coef(lm(log10(price) ~ log10(carat), data = diamonds))
ggplot(diamonds, aes(log10(carat), log10(price))) +
geom_bin2d() +
geom_abline(intercept = mod_coef[1], slope = mod_coef[2],
colour = "white", size = 1) +
facet_wrap(~cut, nrow = 1)

Collective Geoms

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)

data(Oxboys)
head(Oxboys)
Grouped Data: height ~ age | Subject
  Subject     age height Occasion
1       1 -1.0000  140.5        1
2       1 -0.7479  143.4        2
3       1 -0.4630  144.8        3
4       1 -0.1643  147.1        4
5       1 -0.0027  147.7        5
6       1  0.2466  150.2        6

Multiple Groups, One Aesthetic

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)

ggplot(Oxboys, aes(age, height, group = Subject)) +
geom_point() +
geom_line()

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)

ggplot(Oxboys, aes(age, height)) +
geom_point() +
geom_line()

Different Groups on Different Layers

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)

ggplot(Oxboys, aes(age, height, group = Subject)) +
geom_line() +
geom_smooth(method = "lm", se = FALSE)

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)

ggplot(Oxboys, aes(age, height)) +
geom_line(aes(group = Subject)) +
geom_smooth(method = "lm", size = 2, se = FALSE)

Overriding the Default Grouping

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)

ggplot(Oxboys, aes(Occasion, height)) +
geom_boxplot()

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)

ggplot(Oxboys, aes(Occasion, height)) +
geom_boxplot() +
geom_line(colour = "#3366FF", alpha = 0.5)

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)

ggplot(Oxboys, aes(Occasion, height)) +
geom_boxplot() +
geom_line(aes(group = Subject), colour = "#3366FF", alpha = 0.5)

Matching Aesthetics to Graphic Objects

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)

df <- data.frame(x = 1:3, y = 1:3, colour = c(1,3,5))
ggplot(df, aes(x, y, colour = factor(colour))) +
geom_line(aes(group = 1), size = 2) +
geom_point(size = 5)

ggplot(df, aes(x, y, colour = colour)) +
geom_line(aes(group = 1), size = 2) +
geom_point(size = 5)

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)

xgrid <- with(df, seq(min(x), max(x), length = 50))
interp <- data.frame(
x = xgrid,
y = approx(df$x, df$y, xout = xgrid)$y,
colour = approx(df$x, df$colour, xout = xgrid)$y
)
ggplot(interp, aes(x, y, colour = colour)) +
geom_line(size = 2) +
geom_point(data = df, size = 5)

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)

g1<-ggplot(mpg, aes(class)) +
geom_bar()

g2<-ggplot(mpg, aes(class, fill = drv)) +
geom_bar()

grid.arrange(g1,g2,ncol=2)

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)

g1<-ggplot(mpg, aes(class, fill = hwy)) +
geom_bar()

g2<-ggplot(mpg, aes(class, fill = hwy, group = hwy)) +
geom_bar()

grid.arrange(g1,g2,ncol=2)

library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)

g1<-ggplot(mpg, aes(displ, cty)) + geom_boxplot()

g2<-ggplot(mpg, aes(factor(displ), cty)) + geom_boxplot()

grid.arrange(g1,g2,ncol=2)

Surface Plots

library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)

g1<-ggplot(faithfuld, aes(eruptions, waiting)) + geom_contour(aes(z = density, colour = ..level..))
ggplotly(g1)
g2<-ggplot(faithfuld, aes(eruptions, waiting)) + geom_raster(aes(fill = density))
ggplotly(g2)
grid.arrange(g1,g2,ncol=2)

# Bubble plots work better with fewer observations
small <- faithfuld[seq(1, nrow(faithfuld), by = 10), ]
ggplot(small, aes(eruptions, waiting)) +
geom_point(aes(size = density), alpha = 1/3) +
scale_size_area()

Drawing Maps

Vector Boundaries

  • Vector boundaries are defined by a data frame with one row for each “corner” of a geographical region like a country, state, or county. It requires four variables:
  1. lat and long, giving the location of a point.

  2. group, a unique identifier for each contiguous region.

  3. id, the name of the region.

  • Separate group and id variables are necessary because sometimes a geographical unit isn’t a contiguous polygon. For example, Hawaii is composed of multiple islands that can’t be drawn using a single polygon.

  • The following code extracts that data from the built in maps package using ggplot2::map data(). The maps package isn’t particularly accurate or up-to-date, but it’s built into R so it’s a reasonable place to start.

NB: Could not complete this part because some of the packages are not available for version 3.6.1.

Revealing Uncertainty

  1. Discrete x, range: geom errorbar(), geom linerange().

  2. Discrete x, range & center: geom crossbar(), geom pointrange().

  3. Continuous x, range: geom ribbon().

  4. Continuous x, range & center: geom smooth(stat = “identity”)

library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)

y <- c(18, 11, 16)
df <- data.frame(x = 1:3, y = y, se = c(1.2, 0.5, 1.0))
base <- ggplot(df, aes(x, y, ymin = y - se, ymax = y + se))
g1<-base + geom_crossbar()
g2<-base + geom_pointrange()
g3<-base + geom_smooth(stat = "identity")
g4<-base + geom_errorbar()
g5<-base + geom_linerange()
g6<-base + geom_ribbon()

grid.arrange(g1,g2,g3,g4,g5,g6,ncol=3,nrow=2)

Weighted Data

  1. Nothing, to look at numbers of counties.

  2. Total population, to work with absolute numbers.

  3. Area, to investigate geographic effects. (This isn’t useful for midwest, but would be if we had variables like percentage of farmland.)

library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)

# Unweighted
g1<-ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point()

# Weight by population
g2<-ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point(aes(size = poptotal / 1e6)) +
scale_size_area("Population\n(millions)", breaks = c(0.5, 1, 2, 4))

grid.arrange(g1,g2,ncol=2)

library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)

# Unweighted
g1<-ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point() +
geom_smooth(method = lm, size = 1)

# Weighted by population
g2<-ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point(aes(size = poptotal / 1e6)) +
geom_smooth(aes(weight = poptotal), method = lm, size = 1) +
scale_size_area(guide = "none")

grid.arrange(g1,g2,ncol=2)

library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)

g1<-ggplot(midwest, aes(percbelowpoverty)) +
geom_histogram(binwidth = 1) +
ylab("Counties")

g2<-ggplot(midwest, aes(percbelowpoverty)) +
geom_histogram(aes(weight = poptotal), binwidth = 1) +
ylab("Population (1000s)")

grid.arrange(g1,g2,ncol=2)

Diamonds Data

Displaying Distributions

  • There are a number of geoms that can be used to display distributions, depending on the dimensionality of the distribution, whether it is continuous or discrete, and whether you are interested in the conditional or joint distribution.

  • For 1d continuous distributions the most important geom is the histogram, geom histogram():

library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
data(diamonds)

g1<-ggplot(diamonds, aes(depth)) + geom_histogram()

g2<-ggplot(diamonds, aes(depth)) + geom_histogram(binwidth = 0.1) + xlim(55, 70)

grid.arrange(g1,g2,ncol=2)

  • It is important to experiment with binning to find a revealing view. You can change the binwidth, specify the number of bins, or specify the exact location of the breaks. Never rely on the default parameters to get a revealing view of the distribution. Zooming in on the x axis, xlim(55, 70), and selecting a smaller bin width, binwidth = 0.1, reveals far more detail.

  • When publishing figures, don’t forget to include information about important parameters (like bin width) in the caption.

  • If you want to compare the distribution between groups, you have a few options:

  1. Show small multiples of the histogram, facet wrap(~ var).

  2. Use colour and a frequency polygon, geom freqpoly().

  3. Use a “conditional density plot”, geom histogram(position = “fill”).

  • The frequency polygon and conditional density plots are shown below. The conditional density plot uses position fill() to stack each bin, scaling it to the same height. This plot is perceptually challenging because you need to compare bar heights, not positions, but you can see the strongest patterns.
library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
data(diamonds)

g1<-ggplot(diamonds, aes(depth)) +
geom_freqpoly(aes(colour = cut), binwidth = 0.1, na.rm = TRUE) +
xlim(58, 68) +
theme(legend.position = "none")

g2<-ggplot(diamonds, aes(depth)) +
geom_histogram(aes(fill = cut), binwidth = 0.1, position = "fill",
na.rm = TRUE) +
xlim(58, 68) +
theme(legend.position = "none")

grid.arrange(g1,g2,ncol=2)

  • (I’ve suppressed the legends to focus on the display of the data.)

  • Both the histogram and frequency polygon geom use the same underlying statistical transformation: stat = “bin”. This statistic produces two output variables: count and density. By default, count is mapped to y-position, because it’s most interpretable. The density is the count divided by the total count multiplied by the bin width, and is useful when you want to compare the shape of the distributions, not the overall size.

  • An alternative to a bin-based visualisation is a density estimate. geom density() places a little normal distribution at each data point and sums up all the curves. It has desirable theoretical properties, but is more difficult to relate back to the data. Use a density plot when you know that the underlying density is smooth, continuous and unbounded. You can use the adjust parameter to make the density more or less smooth.

library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
data(diamonds)

g1<-ggplot(diamonds, aes(depth)) +
geom_density(na.rm = TRUE) +
xlim(58, 68) +
theme(legend.position = "none")

g2<-ggplot(diamonds, aes(depth, fill = cut, colour = cut)) +
geom_density(alpha = 0.2, na.rm = TRUE) +
xlim(58, 68) +
theme(legend.position = "none")

grid.arrange(g1,g2,ncol=2)

  • Note that the area of each density estimate is standardised to one so that you lose information about the relative size of each group.

  • The histogram, frequency polygon and density display a detailed view of the distribution. However, sometimes you want to compare many distributions, and it’s useful to have alternative options that sacrifice quality for quantity. Here are three options:

  1. geom boxplot(): the box-and-whisker plot shows five summary statistics along with individual “outliers”. It displays far less information than a histogram, but also takes up much less space. You can use boxplot with both categorical and continuous x. For continuous x, you’ll also need to set the group aesthetic to define how the x variable is broken up into bins. A useful helper function is cut width():
library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
data(diamonds)

g1<-ggplot(diamonds, aes(clarity, depth)) +
geom_boxplot()

g2<-ggplot(diamonds, aes(carat, depth)) +
geom_boxplot(aes(group = cut_width(carat, 0.1))) +
xlim(NA, 2.05)

grid.arrange(g1,g2,ncol=2)

  1. geom violin(): the violin plot is a compact version of the density plot. The underlying computation is the same, but the results are displayed in a similar fashion to the boxplot:
library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
data(diamonds)

g1<-ggplot(diamonds, aes(clarity, depth)) +
geom_violin()

g2<-ggplot(diamonds, aes(carat, depth)) +
geom_violin(aes(group = cut_width(carat, 0.1))) +
xlim(NA, 2.05)

grid.arrange(g1,g2,ncol=2)

  1. geom dotplot(): draws one point for each observation, carefully adjusted in space to avoid overlaps and show the distribution. It is useful for smaller datasets.

Dealing with Overplotting

  1. Very small amounts of overplotting can sometimes be alleviated by making the points smaller, or using hollow glyphs. The following code shows some options for 2000 points sampled from a bivariate normal distribution.
library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
data(diamonds)

df <- data.frame(x = rnorm(2000), y = rnorm(2000))
norm <- ggplot(df, aes(x, y)) + xlab(NULL) + ylab(NULL)
g1<-norm + geom_point()
g2<-norm + geom_point(shape = 1) # Hollow circles
g3<-norm + geom_point(shape = ".") # Pixel sized

grid.arrange(g1,g2,g3,ncol=3)

  1. For larger datasets with more overplotting, you can use alpha blending (transparency) to make the points transparent. If you specify alpha as a ratio, the denominator gives the number of points that must be overplotted to give a solid colour. Values smaller than ˜1/500 are rounded down to zero, giving completely transparent points.
library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
data(diamonds)

g1<-norm + geom_point(alpha = 1 / 3)
g2<-norm + geom_point(alpha = 1 / 5)
g3<-norm + geom_point(alpha = 1 / 10)

grid.arrange(g1,g2,g3,ncol=3)

  1. If there is some discreteness in the data, you can randomly jitter the points to alleviate some overlaps with geom jitter(). This can be particularly useful in conjunction with transparency. By default, the amount of jitter added is 40% of the resolution of the data, which leaves a small gap between adjacent regions. You can override the default with width and height arguments.
  1. Bin the points and count the number in each bin, then visualise that count (the 2d generalisation of the histogram), geom bin2d(). Breaking the plot into many small squares can produce distracting visual artefacts. (D. B.Carr et al., 1987) suggests using hexagons instead, and this is implemented in geom hex(), using the hexbin package (D. Carr et al., 2015).The code below compares square and hexagonal bins, using parameters bins and binwidth to control the number and size of the bins.
library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
data(diamonds)

g1<-norm + geom_bin2d()
g2<-norm + geom_bin2d(bins = 10)
g3<-norm + geom_hex()
g4<-norm + geom_hex(bins = 10)

grid.arrange(g1,g2,g3,g4,ncol=2,nrow=2)

  1. Estimate the 2d density with stat density2d(), and then display using one of the techniques for showing 3d surfaces in Sect. 3.6.

  2. If you are interested in the conditional distribution of y given x, then the techniques of Sect. 2.6.3 will also be useful.

Statistical Summaries

library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
data(diamonds)

g1<-ggplot(diamonds, aes(color)) + geom_bar()

g2<-ggplot(diamonds, aes(color, price)) + geom_bar(stat = "summary_bin", fun.y = mean)

grid.arrange(g1,g2,ncol=2)

library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
data(diamonds)

g1<-ggplot(diamonds, aes(table, depth)) +
geom_bin2d(binwidth = 1, na.rm = TRUE) +
xlim(50, 70) +
ylim(50, 70)

g2<-ggplot(diamonds, aes(table, depth, z = price)) +
geom_raster(binwidth = 1, stat = "summary_2d", fun = mean,
na.rm = TRUE) +
xlim(50, 70) +
ylim(50, 70)

grid.arrange(g1,g2,ncol=2)

Add-on Packages

  1. animInt, https://github.com/tdhock/animint, lets you make you ggplot2 graphics interactive, adding querying, filtering and linking.

  2. GGally, https://github.com/ggobi/ggally, provides a very flexible scatterplot matrix, amongst other tools.

  3. ggbio, http://www.tengfei.name/ggbio/, provides a number of specialised geoms for genomic data.

  4. ggdendro, https://github.com/andrie/ggdendro, turns data from tree methods in to data frames that can easily be displayed with ggplot2.

  5. ggfortify, https://github.com/sinhrks/ggfortify, provides fortify and autoplot methods to handle objects from some popular R packages.

  6. ggenealogy, https://cran.r-project.org/package=ggenealogy, helps explore and visualise genealogy data.

  7. ggmcmc, http://xavier-fim.net/packages/ggmcmc/, provides a set of flexible tools for visualising the samples generated by MCMC methods.

  8. ggparallel, https://cran.r-project.org/package=ggparallel: easily draw parallel coordinates plots, and the closely related hammock and common angle plots.

  9. ggtern, http://www.ggtern.com, lets you use ggplot2 to draw ternary diagrams, used when you have three variables that always sum to one.

  10. ggtree, https://github.com/GuangchuangYu/ggtree, provides tools to view and annotate phylogenetic tree with different types of meta-data.

  11. granovaGG, https://github.com/briandk/granovaGG, provides tools to visualise ANOVA results.

  12. plotluck, https://github.com/stefan-schroedl/plotluck: the ggplot2 version of Google’s “I’m feeling lucky”. It automatically creates plots for one, two or three variables.

References